Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize Barrier behavior during waiting #3464

Closed
wants to merge 20 commits into from

Conversation

p-datadog
Copy link
Contributor

What does this PR do?

This PR modifies the Barrier class to have, in my opinion, more understandable/easier to reason about timeout behavior during the waits.

The new behavior is as follows:

  • A timeout can be defined for the barrier. Any wait will wait up to this timeout from the moment the barrier is created (and its corresponding operation, presumably, started).
  • If a timeout is provided for the individual wait operation, the wait will be no longer than this timeout (but would not end later than barrier creation + barrier timeout).
  • If multiple waits are done, they will continue waiting until the barrier timeout elapses from barrier construction (or until the individual wait timeouts, whichever is shorter).

Previous behavior:

  • A timeout can be defined for the barrier, but this timeout is applied to the wait operations, not to the barrier operation.
  • When waiting, the wait timeout replaces the barrier timeout, even if the wait timeout is longer.
  • The wait is always waiting up to the full selected timeout, even if the barrier was created (and thus barrier operation started) a long time ago.
  • Upon completion of (any) wait, the barrier is marked as having been lifted.

Examples

  1. Start operation with 2 second timeout, after 1 second, wait for it.

New behavior: after 1 second of waiting, control returns to caller.
Previous behavior: after 2 seconds of waiting, control returns to caller.

  1. Start operation with 2 second timeout, after 3 seconds, wait for it.

New behavior: control returns instantly.
Previous behavior: after 2 seconds of waiting, control returns to caller.

  1. Start operation with 2 second timeout, wait for it with 1 second timeout,
    after 1 second second thread waits with 1 second timeout.

New behavior: Each waiting thread waits 1 second, then continues.
Previous behavior: First thread waits 1 second, second thread continues immediately.

Motivation:

The existing behavior is I think confusing and, although probably does not present issues given the class is used in a very limited way by dd-trace-rb, may cause issues if the usage of Barrier was expanded.

Additional Notes:

This PR is part of the fix for thread leaks which will be in a follow-up PR.

How to test the change?

This PR adds several unit tests which check elapsed time of wait operations. The shortest timeouts are set to 0.25 seconds, which is hopefully sufficient to not produce flakiness in CI while keeping the test runs quick.

The entire barrier_spec.rb takes 7 seconds to execute currently.

@p-datadog p-datadog requested a review from a team as a code owner February 16, 2024 18:37
@github-actions github-actions bot added the core Involves Datadog core libraries label Feb 16, 2024
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments!

I think in general this class has been problematic (e.g. see https://github.com/DataDog/ruby-guild/issues/50 and #3002 and #2990 ), so I left several suggestions for further simplifying it, which hopefully would also help in reducing the flakiness.

@once = false
@timeout = timeout
@lifted = false
@deadline = timeout && Core::Utils::Time.get_time + timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure this is the correct intended semantics.

From reading the code, my understanding is that the intention is that the timeout gets configured at component initialization time, but the actual timeout would only start counting later, when the worker gets lazily initialized.

I guess cc @lloeki can help clarify :)

Comment on lines +120 to +122
# If timeout is provided in this call, waits up to the smaller of
# the provided wait timeout and the barrier timeout since the
# barrier was created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a private class, and we never actually need to provide this second timeout in production, I'm wondering if we should remove this feature until we need it.

Comment on lines +124 to +126
# If neither wait timeout is provided in this call nor the
# barrier timeout in the constructor, waits indefinitely until
# the barrier is lifted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should remove this feature too, since again, the one user of this class doesn't actually use this functionality ;)

Comment on lines +152 to +153
# workaround for rubocop & steep trying to mangle the code
if timeout && timeout.public_send(:<=, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Rubocop is being annoying, I suggest using an inline # rubocop:disable instead of making the code "worse" just to make it happy.

I'm curious about the issue with steep -- maybe it's something we can fix in the type signatures? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

steep refuses to permit timeout <= 0 on account of timeout allegedly being nil (the preceding check for it being truthy doesn't count, apparently).

Comment on lines 158 to +167
# - starting with Ruby 3.2, ConditionVariable#wait returns nil on
# timeout and an integer otherwise
# - before Ruby 3.2, ConditionVariable returns itself
# so we have to rely on @once having been set
if RUBY_VERSION >= '3.2'
lifted = @condition.wait(@mutex, timeout)
else
@condition.wait(@mutex, timeout)
lifted = @once
end
# so we have to rely on @lifted having been set
lifted = if RUBY_VERSION >= '3.2'
!!@condition.wait(@mutex, timeout)
else
@condition.wait(@mutex, timeout)
@lifted
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: To be honest, I'm not sure it's worth keeping a multiple line comment + 2 implementations, rather than just using the one implementation that works on all Rubies ;)


@once ||= true
@mutex.synchronize do
@once ||= true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still correct? Should this be @lifted?

end

before do
record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I believe this can be simplified by making record a let!

@p-datadog
Copy link
Contributor Author

Thank you for the review @ivoanjo , I discussed this with @lloeki and this PR definitely needs more work.

At a minimum, my assumption of when the "work" process starts was incorrect; to continue in the spirit of this PR I would need to add a method that would start the "timeout from work start". Construction time isn't at all correct to take for the moment when work starts.

Second issue that @lloeki raised was that in the existing implementation, waiting threads wouldn't all be unblocked together, but in the implementation proposed in this PR they would be (as soon as the work timeout expires). Unblocking all threads at the same time was not an appealing idea.

And generally, whether the overall approach taken by this PR was the desired one (i.e. whether timeouts should count from when the work started or from when the waits started), is up for debate.

I believe this PR is orthogonal to the thread leak issue. I opened it because I thought it would make the code easier to reason about but given the feedback perhaps the barrier can be left as is for now and I'll be focusing on the start/stop calls right now.

@p-datadog p-datadog closed this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants